48 research outputs found

    Expressive Talking Head Video Encoding in StyleGAN2 Latent-Space

    Full text link
    While the recent advances in research on video reenactment have yielded promising results, the approaches fall short in capturing the fine, detailed, and expressive facial features (e.g., lip-pressing, mouth puckering, mouth gaping, and wrinkles) which are crucial in generating realistic animated face videos. To this end, we propose an end-to-end expressive face video encoding approach that facilitates data-efficient high-quality video re-synthesis by optimizing low-dimensional edits of a single Identity-latent. The approach builds on StyleGAN2 image inversion and multi-stage non-linear latent-space editing to generate videos that are nearly comparable to input videos. While existing StyleGAN latent-based editing techniques focus on simply generating plausible edits of static images, we automate the latent-space editing to capture the fine expressive facial deformations in a sequence of frames using an encoding that resides in the Style-latent-space (StyleSpace) of StyleGAN2. The encoding thus obtained could be super-imposed on a single Identity-latent to facilitate re-enactment of face videos at 102421024^2. The proposed framework economically captures face identity, head-pose, and complex expressive facial motions at fine levels, and thereby bypasses training, person modeling, dependence on landmarks/ keypoints, and low-resolution synthesis which tend to hamper most re-enactment approaches. The approach is designed with maximum data efficiency, where a single W+W+ latent and 35 parameters per frame enable high-fidelity video rendering. This pipeline can also be used for puppeteering (i.e., motion transfer).Comment: The project page is located at https://trevineoorloff.github.io/ExpressiveFaceVideoEncoding.io

    One-Shot Face Video Re-enactment using Hybrid Latent Spaces of StyleGAN2

    Full text link
    While recent research has progressively overcome the low-resolution constraint of one-shot face video re-enactment with the help of StyleGAN's high-fidelity portrait generation, these approaches rely on at least one of the following: explicit 2D/3D priors, optical flow based warping as motion descriptors, off-the-shelf encoders, etc., which constrain their performance (e.g., inconsistent predictions, inability to capture fine facial details and accessories, poor generalization, artifacts). We propose an end-to-end framework for simultaneously supporting face attribute edits, facial motions and deformations, and facial identity control for video generation. It employs a hybrid latent-space that encodes a given frame into a pair of latents: Identity latent, WID\mathcal{W}_{ID}, and Facial deformation latent, SF\mathcal{S}_F, that respectively reside in the W+W+ and SSSS spaces of StyleGAN2. Thereby, incorporating the impressive editability-distortion trade-off of W+W+ and the high disentanglement properties of SSSS. These hybrid latents employ the StyleGAN2 generator to achieve high-fidelity face video re-enactment at 102421024^2. Furthermore, the model supports the generation of realistic re-enactment videos with other latent-based semantic edits (e.g., beard, age, make-up, etc.). Qualitative and quantitative analyses performed against state-of-the-art methods demonstrate the superiority of the proposed approach.Comment: The project page is located at https://trevineoorloff.github.io/FaceVideoReenactment_HybridLatents.io

    COVID-VTS: Fact Extraction and Verification on Short Video Platforms

    Full text link
    We introduce a new benchmark, COVID-VTS, for fact-checking multi-modal information involving short-duration videos with COVID19- focused information from both the real world and machine generation. We propose, TwtrDetective, an effective model incorporating cross-media consistency checking to detect token-level malicious tampering in different modalities, and generate explanations. Due to the scarcity of training data, we also develop an efficient and scalable approach to automatically generate misleading video posts by event manipulation or adversarial matching. We investigate several state-of-the-art models and demonstrate the superiority of TwtrDetective.Comment: 11 pages, 5 figures, accepted to EACL202

    Mitigating Hallucination in Large Multi-Modal Models via Robust Instruction Tuning

    Full text link
    Despite the promising progress in multi-modal tasks, current large multi-modal models (LMMs) are prone to hallucinating inconsistent descriptions with respect to the associated image and human instructions. This paper addresses this issue by introducing the first large and diverse visual instruction tuning dataset, named Large-scale Robust Visual (LRV)-Instruction. Our dataset comprises 400k visual instructions generated by GPT4, covering 16 vision-and-language tasks with open-ended instructions and answers. Unlike existing studies that primarily focus on positive instruction samples, we design LRV-Instruction to include both positive and negative instructions for more robust visual instruction tuning. Our negative instructions are designed at three semantic levels: (i) Nonexistent Object Manipulation, (ii) Existent Object Manipulation and (iii) Knowledge Manipulation. To efficiently measure the hallucination generated by LMMs, we propose GPT4-Assisted Visual Instruction Evaluation (GAVIE), a stable approach to evaluate visual instruction tuning like human experts. GAVIE does not require human-annotated groundtruth answers and can adapt to diverse instruction formats. We conduct comprehensive experiments to investigate the hallucination of LMMs. Our results demonstrate existing LMMs exhibit significant hallucinations when presented with our negative instructions, particularly Existent Object and Knowledge Manipulation instructions. Moreover, we successfully mitigate hallucination by finetuning MiniGPT4 and mPLUG-Owl on LRV-Instruction while improving performance on several public datasets compared to state-of-the-art methods. Additionally, we observed that a balanced ratio of positive and negative instances in the training data leads to a more robust model.Comment: 40 pages, 32 figures. Under Revie

    Human Emotion Recognition from Motion Using a Radial Basis Function Network Architecture

    Get PDF
    (Also cross-referenced as CAR-TR-721) In this paper a radial basis function network architecture is developed that learns the correlation between facial feature motion patterns and human emotions. We describe a hierarchical approach which at the highest level identifies emotions, at the mid level determines motions of facial features, and at the low level recovers motion directions. Individual emotion networks were trained to recognize the 'smile" and "surprise" emotions. Each network was trained by viewing a set of sequences of one emotion for many subjects. The trained neural network was then tested for retention, extrapolation and rejection ability. Success rates were about 88% for retention, 73Wo for extrapolation, and 79% for rejection

    Temporal Multi-scale Models for Flow and Acceleration

    No full text
    A model for computing image flow in image sequences containing a very wide range of instantaneous flows is proposed. This model integrates the spatio-temporal image derivatives from multiple temporal scales to provide both reliable and accurate instantaneous flow estimates. The integration employs robust regression and automatic scale weighting in a generalized brightness constancy framework. In addition to instantaneous flow estimation the model supports recovery of dense estimates of image acceleration and can be readily combined with parameterized flow and acceleration models. A demonstration of performance on image sequences of typical human actions taken with a high frame-rate camera is given. 1 Introduction Image motion estimation involves relating temporal changes in image intensity across the spatial dimensions. Articulated and deformable motions such as those encountered in images of humans in motion give rise to image sequences having, instantaneously, a wide range of flow m..

    Parameterized Modeling and Recognition of Activities

    No full text
    In this paper we consider a class of human activities-atomic activities- which can be represented as a set of measurements over a finite temporal window (e.g., the motion of human body parts during a walking cycle) and which has a relatively small space of variations in performance. A new approach for modeling and recognition of atomic activities that employs principal component analysis and analytical global transformations, is proposed. The modeling of sets of exemplar instances of activities that are similar in duration and involve similar body part motions is achieved by parameterizing their representation using principal component analysis. The recognition of variants of modeled activities is achieved by searching the space of admissible parameterized transformations that these activities can undergo. This formulation iteratively refines the recognition of the class to which the observed activity belongs and the transformation parameters that relate it to the model in its class. W..

    Tracking Rigid Motion using a Compact-Structure Constraint

    No full text
    An approach for tracking the motion of a rigid object using parameterized flow models and a compact-structure constraint is proposed. While polynomial parameterized flow models have been shown to be effective in tracking the rigid motion of planar objects, these models are inappropriate for tracking moving objects that change appearance revealing their 3D structure. We extend these models by adding a structure-compactness constraint that accounts for image motion that deviates from a planar structure. The constraint is based on the assumption that object structure variations are limited with respect to planar object projection onto the image plane and therefore can be expressed as a direct constraint on the image motion. The performance of the algorithm is demonstrated on several long image sequences of rigidly moving objects. 1 Introduction Tracking a moving object in an image sequence is a fundamental capability of a vision system. Tracking can be defined as the process of identify..
    corecore